This document is structured as follows:

Research question to answer

Train and test MSEP performances

A controlled setup

  • Number of joint components across the columns
  • Dimensionality of \(x\) is 200
  • Sample sizes and noise proportions across rows
  • Black line represents median true test error
  • Main conclusion: High noise, small sample size leads to overfitting of (O2)PLS
  • In that case, the MSEP is overshadowed by noise, as predictable part is very small
load('~/MEGANZ/LUMC dingen/LUMC/PhD/Paper 4 PO2PLS/PO2PLS_Software/R2outp_all_comps_nois_topss.RData')
p <- ggplot(data=R2outp %>% filter(p=="200"), aes(x=method, y=sqrt(value))) +
  geom_boxplot(aes(col = type %>% factor(c('train','test')))) +
  geom_hline(yintercept = 1, col = "gray", lty=2) + 
  geom_hline(data=R2outp%>%filter(method=="TRuE"&type=="test"&p=="200")%>%
               group_by(N, noise, nr_comp)%>%summarise(avg=median(sqrt(value))), 
             aes(yintercept = avg)) +
  facet_grid(N*noise ~ nr_comp, scales = 'free') +
  theme_bw() + scale_x_discrete("Method") + scale_y_continuous("RMSEP") +
  theme(axis.title = element_text(face="bold", size=16)) +
  scale_color_discrete("Type")
ggplotly(p)

Difference in error PO2PLS - O2PLS

  • Same data as above, differently represented
  • Difference of prediction error within simulated runs
  • Positive difference indicates O2PLS had lower error
  • Gray line means zero difference
R2diff <- R2outp %>% 
  filter(method=="po2m") %>% 
  select(-method) %>% 
  mutate(value.po2m=value) %>% 
  select(-value) %>% 
  bind_cols(R2outp %>% 
              filter(method=="o2m") %>% 
              select(-method) %>% 
              mutate(value.o2m=value) %>% 
              select(value.o2m))

R2diff %<>% mutate(dif = value.po2m - value.o2m)
p <- R2diff %>% ggplot(aes(x=nr_comp, y=dif)) + 
  geom_boxplot(aes(col=type %>% factor(c('train','test')))) + 
  facet_grid(N*noise ~ p, scales = 'free') + 
  geom_hline(yintercept = 0, lty=2, col="gray")
ggplotly(p)

Conclusions MSEP comparison

  • Overfitting prominent in small \(N\) large noise case
  • Exaggeration of this effect with larger number of components
  • Test error almost always in favor of PO2PLS, except in large \(N\) low dimensionality case

Correct top 25% in very high D

p1 <- ggplot(data=topss%>%filter(p=="200"), aes(x=method, y=sqrt(value))) +
  geom_boxplot(aes(col = method)) +
  geom_hline(yintercept = 1, col = "gray", lty=2) + 
  facet_grid(N*noise ~ nr_comp*p, scales = 'free') +
  theme_bw() + scale_x_discrete("Method") + scale_y_continuous("TPR") +
  theme(axis.title = element_text(face="bold", size=16))

ggplotly(p1)
topssdiff <- topss %>% 
  filter(method=="po2m") %>% 
  select(-method) %>% 
  mutate(value.po2m=value) %>% 
  select(-value) %>% 
  bind_cols(topss %>% 
              filter(method=="o2m") %>% 
              select(-method) %>% 
              mutate(value.o2m=value) %>% 
              select(value.o2m))
topssdiff %<>% mutate(dif = value.po2m - value.o2m)

p2 <- topssdiff %>% ggplot(aes(x=nr_comp, y=dif)) + 
  geom_boxplot(aes(col=nr_comp)) + 
  facet_grid(N*noise ~ p, scales = 'free') + 
  geom_hline(yintercept = 0, lty=2, col="gray")
ggplotly(p2)

Conclusions top 25%

  • Difference most affected by noise level and number of components (in favor of PO2PLS)
  • In general, PO2PLS has higher TPR in most runs

Train and test errors of covariance blocks Sx, Sxy, Sy